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Abstract. We present a gravitational hierarchical A^-body code that is designed to 
run efficiently on Graphics Processing Units (GPUs). All parts of the algorithm are 
exectued on the GPU which eliminates the need for data transfer between the Central 
Processing Unit (CPU) and the GPU. Our tests indicate that the gravitational tree-code 
outperforms tuned CPU code for all parts of the algorithm and show an overall perfor- 
mance improvement of more than a factor 20, resulting in a processing rate of more 
than 2.8 million particles per second. 



1. Introduction 

A populair method for simulating gravi tational jV-body syst ems is the hierarchical tree- 
code algorithm orginaly introduced by iBarnes & Hut (1986). This method reduces the 
computational complexity of the simulation from 0(N 2 ) to 0(N log N) per crossing 
time. The former, though computationally expensive, can easily be implemented in 
parallel for many particles. The later requires more communication and book keep- 
ing when developing a parallel method. Still for large number of particles (N > 10 5 ) 
hierarchical methods are more efficient than brute force methods. Currently parallel 
octree implementations are found in a wide range of problems, such as: self gravitat- 
ing systems, smoothed particle hydrodynamics, molecular dynamics, clump finding, 
ray tracing and voxel rendering. All of these problems require a high amount of com- 
putation time. For high resolution simulations (N > 10 5 ) 1 Central Processing Unit 
(CPU) is not sufficient, one has to use computer clusters or even supercomputers, both 
of which are expensive and scarce. A GPU provides an atractive alternative to such 
systems. 

The GPU is a massively parallel processor which is specialised in perf orming inde- 



pende n d parallel computa t ions. For an overview and mor e details se e Portegies Z wart et al 
(120071) : lHamada & Iitakal (120071) : iBelleman et all d2008l) : iGaburov et all (l2009h . In this 



work we have implemented all the seperate parts of the Barnes-Hut tree-code algorithm 
on the GPU. This includes the tree-construction and computation of the tree-properties 
(multipole moments). By doing so we remove the need to communicate large amounts 
of data between the CPU and GPU. Since there is no time lost by CPU-GPU com- 



1 Tree data-structures are commonly referred to as hierarchical data-structures. In this work we use an 
octree data-structure. 
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munication w e can make optimal u s e of the GPU in a b l ock t ime-step algorithm. In 
previous work Ga burov et al.l ( 2010h : lHamada & NitadorilfcOlOb parts of the algorithm 



were executed on the CPU which only allows for efficient (parallel) execution if shared 
time-steps are used. 

Full implementa tion details, performance and accuracy characteristics can be found 
in lBedorf etall(l2012h 



2. Results 

The algorithms are implemented as part of the gravitational A-body code Bonsaifl We 
compare the performance of the GPU algorithms with (optimized) CPU implementa- 
tions of comparable algorithms. The performance of the tree -traverse depends critically 
on the multipole acceptance critera (MAC) (0) which sets the trade-off between speed 
and accuracy. In our implementation we use a combination of the method introd uced by 
Barne s (1994 ) and the method used for tree-traversion on vector machines see iBarnesI 
( 19941 . [l990h . To compare the CPU and GPU implementations we measure the wall- 



clock time for the most time critical parts of the algorithms. For the tree-cons truction 
we dis tinguish three parts; Sorting of the particles along a Space Filling Curve dMorton 



( 1966)) (sorting in the figure) , reordering of particle properties based on the sort (mov- 
ing) and construction of the tree-datastructure (tree-construction). Furthermore, timings 
are presented for the computation of the multipole moments and tree-traverse. As for 
hardware we used a Xeon E5620 CPU with 4 physical cores and a NVIDIA GTX480 
GPU. The resuls are presented in Fig. [T] 

To measure the sc aling of the impl emented algorithms we execute simulations us- 
ing Plummer spheres f Plummerl (Il915h ) with N = 2 15 ( 32k) up to N = 2 22 ( 4M) 



particles. We measure the performance of the same algorithms as in the previous para- 
graph. The results can be found in Fig. |2] The wall-clock time spent in the sorting, 
moving, tree-construction and multipole computation algorithms scales linearly with N 
for N > 10 6 . For smaller N, however, the scaling is sub-linear, because the algorithms 
require more than 10 5 particles to saturate the GPU. In theory the tree-traverse scales 
as 0(N log N), whereas empirically the wall-clock time scales almost linearly with N. 
This is explained in the inset of Fig. |2j which shows the average number of interactions 
per particle during the simulation. The average number of particle-cell interactions 
doubles between N > 32k and N < 1M and keeps gradually increasing for N > 1M. 
This break is not clearly visible in the timing results since for small particle numbers 
(N < 1M ) not all GPU resources are satured. Finally, more than 90% of the wall-clock 
time is spent on tree-traversion with 9 = 0.75. This allows for block time-step execu- 
tion where the tree-traverse time is reduced by a factor N/N act i vc , where Active is the 
number of particles that have to be updated. 



3. Discussion and Conclusions 

We have presented an efficient gravitational A-body tree-code. In contrast to other 
existing GPU tree-codes, this implementation is executed completely on the GPU. On 



2 The code is publicly available at: 

http : // castle . strw . leidenuniv . nl/sof tware . html 
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Figure 1 . Wall-clock time spent by the CPU and the GPU on various primitive 
algorithms. The bars show the time spent on the five selected sections of code. The 
results indicate that our GPU code outperforms the CPU code on all fronts and is 
between 2 and 30 times faster. Note that the y-axis is in logscale. (Timings using a 
2 20 million body Plummer sphere with 6 = 0.75) 
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Figure 2. The wall-clock time spent by various parts of the program versus the 
number of particles N. We used Plummer models as initial conditions and varied the 
number of particles over two orders of magnitude. The scaling of the tree-walk is 
between O(N) (shown with the black solid line) and the theoretical 0(N log AO and 
is due to the average number of interactions staying roughly constant (see inset). The 
asymptotic complexity of the tree-construction approaches O(N), as expected, since 
all the constituent primitives share the same complexity. The timings are from the 
GTX480 GPU with 6 = 0.75. 
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a GTX480 the number of particles processed per unit time is 2.8 million particles per 
second with 9 = 0.75. This allows us to routinely carry out simulations on the GPU. 
Since the current version can only use 1 GPU, the limitation is the amount of memory. 
For 5 million particles ~ 1 gigabyte of GPU memory is required. 

Even though the sorting, moving and tree-construction parts of the code take up 
roughly 10% of the execution time in the presented timings, these methods do not have 
to be executed during each time-step when using the block time-step method. It is suf- 
ficient to only recompute the multipole moments of tree-cells that have updated child 
particles, and only when the tree-traverse shows a considerable decline in performance 
does the complete tree-structure has to be rebuild. This decline is the result of ineffi- 
cient memory reads and an increase of the average number of particle-cell and particle- 
particle interactions. This quantity increases because the tree-cell size increases, which 
causes more cells to be opened by the MAC. 

Although the implemented algorithms are designed for a shared-memory architec- 
ture, they can be used to constru ct and traverse tree - structu r es on para l lel GP U clusters 



using the methods described in Wa rren & Salmo n (1993); iDubins ki (1996). Further 



more, in case of a parallel GPU tree-code, the CPU can exchange particles with the 
other nodes, while the GPU is traversing the tree-structure of the local data, making it 
possible to hide most of the communication time. 

The implemented algorithms are not limited to the evaluation of gravitational 
forces, but can be applied to a variety of problems, such as neighbour search, clump 
finding algorithms, fast multipole method and ray tracing. In particular, it is straightfor- 
ward to implement Smoothed Particle Hydrodynamics in such a code, therefore having 
a self-gravitating particle based hydrodynamics code implemented on the GPU. 
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